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Abstract 

Each of two players, by turns, rolls a dice several times accumulating the 
successive scores until he decides to stop, or he rolls an ace. When stopping, the 
accumulated turn score is added to the player account and the dice is given to 
his opponent. If he rolls an ace, the dice is given to the opponent without adding 
any point. In this paper we formulate this game in the framework of competitive 
Markov decision processes (also known as stochastic games), show that the game 
has a value, provide an algorithm to compute the optimal minimax strategy, and 
present results of this algorithm in three different variants of the game. 

Keywords: Competitive Markov processes, Stochastic games, dice games, mini- 
max strategy. 
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1 Introduction 

Consider a two-players dice game in which players accumulate points by turns with 
the following rules. The player who reaches a certain fixed number of points is the 
winner of the game. In his turn each player rolls the dice several times until deciding 
to stop or rolling an ace. If he decides to stop the accumulated successive scores are 
added to his account; while if he rolls an ace no additional points are obtained. As a 
first approach to find optimal strategies for this game Roters [5] studied the optimal 
stopping problem corresponding to the maximisation of the expected score in one 
turn. The optimal solution is a good way of minimising the number of turns required 
to reach the objective. 

Later, Roters & Haigh [3] found the strategy that minimises the expected number 
of turns required to reach the target. This second strategy is better than the one 
obtained in [5] but none of them take into account the consideration of the number 
of points of the opponent, that is clearly relevant in order to win the game. 
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In this paper we formulate this game in the framework of competitive Markov 
decision processes (also known as stochastic games), show that the game has a value, 
provide an algorithm to compute the optimal minimax strategy, and present results 
of this algorithm in three different variants of the game. 

The concept of "stochastic games" was introduced in 1953 by Shapley in j6]. In 
the recent book by Filar and Vrieze |2], the authors provide a general and modern 
comprehensive approach to this theory departing from the theory of controlled Markov 
processes (that can be considered "solitaire" stochastic games) and call the type of 
games we are interested in competitive Markov decision processes, a denomination 
that we find more accurate than the more usual denomination stochastic game. 

During the preparation of this paper we found the related article by Neller and 
Presser |1] where the authors, following an heuristic approach, formulate the Bellman 
equation of the problem (that is a consequence of our results), and compute the 
optimal strategy of a variant of this game. It must be noted that the theory of Filar 
and Vrieze j2], that we follow, provide the solution of the problem in the set of all 
possible strategies, including non-stationary and randomised strategies, i.e. the set 
of behaviour strategies. 

In section 2 we present the theory of competitive Markov decision processes, spe- 
cially in the transient case and we conclude the section with the formulation of the 
theorem we need to solve our dice game. A proof of this theorem can essentially be 
found in [2] . In section 3 we determine the state space of our game, the corresponding 
action spaces for each player, the payoff function of the game, and the Markov tran- 
sitions depending on each state of the process and action of the players. In section 
4 we present two related games: first, in order to win, the player has to reach the 
target exactly (if the target is exceeded, he gives the dice to his opponent without 
changing his accumulated score); in the second variant the players aim to maximise 
the difference between their scores. In section 5 we present the conclusions. 

2 Competitive Markov decision processes 

A competitive Markov decision process (also known as a stochastic game) is the math- 
ematical model of a sequential game, in which two players take actions considering 
the status of a certain Markov processes. Both actions determine an immediate payoff 
for each player and the probability distribution of the following state of the game. 
Our interest is centred in two-players, finite-state and finite-action, zero-sum games. 
To define them formally we need the following ingredients: 

(S) States: A finite set S of the possible states of the game. 

(A) Actions: For each state s € S we consider finite sets and whose elements 
are the possible actions for the players; at each step both players take his actions 
simultaneously and independently. 

(P) Payoffs: For each state s £ S a function r'^ : x ^ M. determines the 
amount that player two has to pay to player one depending on the actions 
taken by both players. 
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(TP) Transition probabilities: For each (s, a,b) £ S x x , Ps,a,b is a distribution 
probability on S, which determines the following state of game. 

We denote by St the state of the game at time t = 0, 1, . . .; the initial state is fixed 
(5*0 = So). At each step t players choose actions At and Bt in A^^ B^^ respectively, 
which determine that player two has to pay to player one an amount of r^*{At, Bt), 
and the distribution probability of 5^+1 will be Pst,At,Bf 

The random variable W = "^tZo^^^i^t, Bt) is the total amount that player two 
pays to player one (could be negative). Note that W depends on the way in which 
players take their actions. The objective of the game for player one is to maximise the 
expected value V of the accumulated payoff W, while player two has the objective of 
minimising it. In principle V could be infinite. Often for economic applications the 
accumulated payoff is Wf) = Y^^Ql3^r^^{At,Bt), called "the discounted sum", where 
< /? < 1 represent the devaluation of the money. This discount factor (3 ensures 
that W is finite with probability one and the existence of its expected value. 

In the case of transient stochastic games, the situation considered in this paper, 
the sum defining W is finite a.s. due to the fact that the process always reaches 
a final state G 5 in a (not necessarily bounded) finite number of steps. In the 
definition of transient stochastic game additional conditions are required, in order to 
V to be finite. Before the formal definition of transient stochastic games, the concept 
of behaviour strategy is introduced. 



2.1 Strategies 

Consider the set K defined by 

K = {(s, a, 6) : s G 5, a G ^^ 6 G B'} 

We define, for each t = 0, 1, . . ., the sets Ht, of admissible histories up to time t, by 

{ S ift = 

IK X ... X IK X 5 else 

t times 

Definition 1 (Beliaviour strategy) Given a stochastic game, a behaviour strat- 
egy for player one (two) in the game is a function tt which associates to each history 

h = (so,ao,6o,...,s) G U^Q^t 

a distribution probability vr(-|/i) in ^4* (respectively (p{-\h) in B^). In the context of 
a stochastic game, we denote by 11 (^S?), the set of all behaviour strategies for player 
one (two). 

Note that the previous definition is in agreement with the intuitive idea that a 
player can choose his action based on the history of the game. There are two relevant 
subclasses of strategies, pure and stationary, introduced below. 
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Definition 2 (Pure strategy) A behaviour strategy vr is said to be pure, if for each 
history h there exists an action such as ■K{ah\h) = 1. We could say that a pure 
strategy chooses the action to he taken in a deterministic way. 

Definition 3 (Stationary strategy) A behaviour strategy vr is said to be station- 
ary, if the probability distribution 7r{-\h) depends only on s, the last state of the history. 
In this case we use the notation 7r(-|s). 

2.2 Probabilistic framework 

We now construct the probability space in which the optimisation procedure takes 
place. Consider the product space 

n = {Sx UsesA' X UsesB'f 

equipped with the product cr-algebra ^, defined as the minimal cr-algebra containing 
the cylinder sets of 0,. 

Given lo = {so,ao,bo, si,ai,bi, . . .) G $7, a sequence of states and actions in the 
product space, the coordinate processes 

{St}t=o,i,..., {At}t=o,i,..., {Bt}t=o,i,... 

are defined by 

^((a;) = St, Atiij) = at, Bt{uj) = bt. 

In this framework, given tt, (p behaviour strategies for players one and two and an 
initial state s G -S, it is possible to introduce a probability 'P 3,1^,^1 such that, for the 
random vector 

Ht = (5*0, Aq, Bq, Si,Ai,Bi, . . . ,St), 

and the finite sequence of states and actions ht = {so,aQ,bQ, . . . , st), the following 
assertions hold: 

• the game starts in the state s, i.e., Ps,iT,<fi{So = s) = 1; 

• with probability 1, Ht take their values in Hj; 

• the probability distribution on the actions chosen by players at time t depends 
on Ht, according to 

at\Ht = ht) = 7r(at|/it) 

bt\Ht = ht) = v{bt\ht) 
bt\Ht = ht) = Tr{at\ht)ip{bt\ht). 

• the distribution probability of St+i depends only on St,At,Bt, being the tran- 
sition probabilities (TP) of the game 

'Ps,-K,(p{St+\ = st+i\Ht = ht,At = at,Bt = bt) = -Pst,at,f>t(st+i)- 
We denote by E^^^^,^ the expected value in the probability space {^,d,Ps,iT,ip)- 



P s,ir,(p{At — 
P s,TV,ip{Bt = 

Ps,Tr,(p{At = at, Bt = 
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2.3 Transient stochastic games 

Definition 4 (Transient stochastic game) A stochastic game is transient when 
there exists a final state sj S such that: 

(1) r'f{a,b) =0, Va G A'f , V6 G B'f; 

(2) Ps,,a,fe(s/) = 1, Va G A'f, V6 G B'f; 

(3) for all pair of strategies (vr,(/7) of player one and two respectively and for all 
initial state s 



^Ps,n,ipiSt / Sf) < OO. 



t=0 

Conditions (1) and (2) ensure that, once the game falls into the final state Sf, it 
never changes the state again and the gain of both players is zero. The third condition 
ensures that the game finishes with probability one. 

Definition 5 (Value of a pair of strategies) Given vr G 11, G strategies for 
players one and two in a transient stochastic game, the value of the strategies is a 
function V-^^ip : 5 — > M defined by 



t=o 

Definition 6 (Optimal strategy) A behaviour strategy vr* for player one in a tran- 
sient stochastic game is said to be optimal if 

inf Vjr* u,{s) = sup inf 14 ip{s), Vs G S. 

Analogously a behaviour strategy (p* for player two, is said to be optimal if 
supl4,(^*(s) = inf sup K-.v^Cs), Vs G S. 

We now formulate the result used to solve our dice game. 

Theorem 2.1 (Value and optimal strategies) Given a transient stochastic game 
the following identity is fulfilled 

sup mf Vn ip(s) = ini sup Vn ip(s), forallsGS. (2-1) 



The vector defined in (2.1), denoted by {v{s))^^g, is called the value of the game. 



This value is the unique joint solution of 
x{s) = 



{a,b) + ^ Ps,a,b{s')x{s') 
s'eS 
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where [■]* represents the value of the matrix game (in the minimax sense) obtained by 
considering rows a G and columns b £ . Moreover, the stationary strategies tt 
and if for players one and two, such that vr(-|s) and (p{-\s) are the optimal strategies 
of the matrix game 

^{a,b) + Y,Ps,a,bWH^') 

for all state s € S, are optimal strategies in the transient stochastic game. 

Proof 1 This theorem is essentially a particular case of theorem 4-2-6 in A 
detailed proof can be found at jl]/. 

Remark 2.1 The previous theorem ensures the existence of optimal strategies for 
both players. Particularly they are in the subclass of stationary strategies. In the 
proof of this theorem a map U such that 



Vv{s) 



^{a,b) + Y,Ps,a,b{^'Hs') 

s'eS 



(2.2) 



which is a n-step contraction, is considered. Afterwards, we used the map U to 
implement a numerical method to find a unique fixed point, that is the value of the 
game. 



3 The dice game 

In this section we describe the states, actions, payoffs, and transition probabilities 
(defined in section 2) corresponding to our dice game, and present the numerical 
results, showing the optimal strategy for a player. This strategy, optimal in the 
class of behaviour strategies, ensures a player to win with probability of at least 1/2 
independently of the opponent strategy. The optimal strategy is pure and stationary, 
and consists in a simple rule indicating whether to roll or to stop, depending on the 
scores of the player and its opponent. 

3.1 Modelling the dice game 

To solve the dice game (compute optimal strategies), we model it as a transient 
stochastic game. We have to specify the set of states, possible actions, payoffs and 
transition probabilities. 

(S) States: During the dice game, there are four aspect varying: the player j who 
has the dice (j = 1, 2), the accumulated score a of player one, the accumulated 
score /3 of player two and the turn score r of the player j. So, we consider states 
(j, a, (3, t). We also need to consider two special states: an initial state sq, and 
a final state Sf. 
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Table 1: Possible actions for each player depending on the state of the game. 





player one 


player two 


Si, Sf 


to wait 


to wait 


(l,«,/3,0) 


to roll 


to wait 


(1, a, /3, r)o<T<20o-a 


to roll, to stop 


to wait 


(l,a,/3, r)T->2oo-a 


to stop 


to wait 


{2, a, (3,0) 


to wait 


to roll 


(2,a,/3, r)o<r<200-/3 


to wait 


to roll, to stop 


{2,a, (3,T)r>200-l3 


to wait 


to stop 



If the score of either of the players is greater or equal than 200 the game is over, 
then, it is in the state sj. Because of that, states (j, a, (3, r), only make sense if 
a < 200, (3 < 200. The same happen if r is big enough to reach 200 stopping. 
So the finite set S of possible states is 

S = {so,Sf}USiUS2 

where Si is the set of states of the player one: 

51 = {(1, a,p,T) -.0 < a < 199, < /3 < 199, < r < 205 - a} 
and 5*2 is the set of states of the player two: 

52 = {(2, a,(3,T) :0<a< 199, < (3 < 199, < r < 205 - (3}. 

(A) Actions: We have to specify the set of actions per state for each player. Pos- 
sible actions in this game are to roll and to stop. We add an extra action to 
wait, which represents that is not the turn of the player. There are some con- 
straints to ensure the transient condition of the stochastic game: in the states 
(1, a, /?, 0)a<2oo does not make sense for player one to take the action to stop 
because there's nothing to loose. If in our model we permit taking the action to 
stop with points in the turn (r = 0), is easy to see that there exist a pair of 
strategies that make the game infinite. The same happen if the action to roll is 
possible when stopping is enough to win. The table [T] shows the set of possible 
actions per state. 

(P) Payoffs: Because we want to maximise the probability of winning, we define 
the payoff function in such a way that maximising the probability of winning 
is equivalent to maximising the expected value V of the payoffs accumulated 
along the game. The model of transient stochastic game allow us to define a 
payoff for each pair {state, action) but in this case is enough to define the payoff 
depending only on the state as follows: 

s_/ 1 if s = (l,a,;3,r) with a -h T > 200 
~ \ else 
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(TP) Transition probabilities: To represent graphically the transition probabilities 
we use the following representationfor the states: 

(l,a,/3,r) =(a^ (2, a, /3, r) = 

The dynamic of the game and the semantic of the states determine transition 
probabilities between states. In the figm'e[T] the probability transitions from a 
state (l,a,/3, r) with a + r < 200, depending on the player decision are pre- 
sented. 



to roll to stop 




Figure 1: Transition probabilities from a state (1, a, /3, r) with a+r < 200, depending 
on the player decision. 

Figure [T] shows that, when the decision is to roll, the distribution probability 
on the states is associated with the results of rolling a dice; particularly the 
probability of loosing the turn is 1/6. In the winning states of player one (i.e. 
(1, a, /5, r) with a + r > 200) the transition is, with probability one, to the final 
state {sf). Transitions for player two are completely symmetric. As is showed in 
figure [2] in the special states Si and Sf transitions do not depend on the actions 
taken by players, indeed they do not have options. 




Figure 2: Transition probabilities from the initial and final states. 

Now we verify that the stochastic game defined above is transient. We then prove 
that Sf satisfies conditions (1), (2) and (3) in definition |4| Conditions (1) and (2) are 
trivially fulfilled, it remains to verify that 

oo 

^Ps,7r,(/j(5't / Sf) < OO, 
i=0 
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for all initial state s, and every pair of strategies tt, ip. 

Due to the fact that at the beginning of a turn the only option is to roll, and that 
in a state in which the accumulated score is enough to win the player has to stop, the 
game can not continue indefinitely. 

For example, if a 6 is rolled consecutively 70 times it is impossible to avoid reaching 
the final state. Denoting by 7 the probability of rolling 70 times a 6, i.e. 7 = 1/6"^^ > 
0, it is easy to see that 7 is a lower bound for the probability oi St = s j ioi t > 70. 
By a similar argument, it follows that, for n = 0, 1, . . . 

P(5£ / s/) < (1 - 7)" if 70n < t < 70(n + 1). 

Then 

00 00 
5ZPs,.,^(St / s/) < 70^(1 -7)" < 00 

t=0 n=0 

and our model is transient. 



3.2 Numerical results 



In this section the results of the theorem 2.1 are applied to the particular case of the 
transient stochastic game defined above. Rewriting the definition of application U, 
defined in equation (2.2), we obtain: 



/ 1 



\Jv{s) 



2^(1,0,0,0) + ^t;(2, 0,0,0) 
v{l,a,P,0)roa 

max{?;(l, a, (3, T)stop, v{l,a, (3, T)roii] 
1 

v{2,a,l3,G)roii 

min {v{2, a, (3, T)stop, v{2, a, f3, T)roii} 




if s = 

if s = {l,a,P,0) 
if s = (1, a, P, t) 
if s = (1, a, P, t) 
if s = {2, a, (3,0) 
if s = (2, Q, (3, r) 
in other case 



a + r < 200 
a + r > 200 

/? + r < 200 



where 



v{l,a,P,T)roU 

v{l,a,P,T)stop 

v{2,a,f3,T)roU 

v{2,a,(3,T)stop 



lv{2, a, P,0) + l Yl=2 ^(1, P.T + k) 
w(2,a + r,/3,0) 

lv{l, a, /?, 0) + I Yl=2 ^(2, a, /3, T + k) 
w(l,a,/3 + r,0). 



Note that in the equations above we have replaced the value of the matrix games 



in equation (2.2) by a maximum, in the states in which player one has to take the 



decision, and by a minimum when is player two the one that has to do it. In the 
states in which both players have only one choice the value of the matrix game is 
the only entry of the matrix. Since there are no states in which both players have to 
decide simultaneously, the stationary strategy that emerges from the theorem turns 
out to be pure, i.e. each player takes an action with probability 1. To determine 
the complete solution is necessary to specify which action should be taken in about 
4 000 000 states. In figure [3] the optimal strategy for player one for some states is 
shown. The complete solution can be found at 
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Figure 3: Part of the optimal strategy for player one (for player two is symmetric). 
It includes states with opponent score /3 = 0, 150, 180 & 185. In the gray zone the 
optimal action is to roll and in the black zone is to stop. 



www . cmat . edu . uy/cmat/docentes/f abian/documentos/opt imalstrategy . pdf . 
Some observations about the solution: 

• At the beginning of the game, when both players have low scores, we see that 
the optimal action is to roll when r < 20 and to stop in the other case, following 
the strategy found by Roters [5] maximising the expected value of a turn score. 

The heuristic interpretation of this fact is: when far away from the target it is 
optimal to approach it in steps as large as possible. 

• When the opponent score (3 becomes larger the optimal strategy becomes riskier. 
This can be explained because there are less turns to reach the target. 

• For opponent scores greater or equal than 187 (/3 > 187), the graphic becomes 
absolutely gray. In other words, when the opponent is close to win, giving him 
the dice is a bad idea. 
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To compare the optimal strategy with the one found by Haigh & Roters [3] , we 
simulate 10 000 000 games. Our simulation showed that in 52% of the games, 
the winner was the player with the optimal strategy. 



4 Two related games 

4.1 Reaching exactly the target 

It is interesting to explore how the optimal strategy changes when the game is mod- 
ified. In this section we consider the same dice game, with the only difference being 
that the condition to win is to reach exactly 200 points. If the sum of accumu- 
lated and turn score is greater than 200 the turn finishes without capitalising any 
point. The formulation of the game is quite similar, the difference appears when 
the accumulated score plus the turn score is greater that 194, situation in which one 
roll of the dice can exceed the target. As an example of the mentioned difference, 
in figure |4] we show the transition probabilities when the accumulated score is 180 
and the turn score is 16 (180 -|- 16 > 194). Note that the probability of losing the 

to roll 




Figure 4: Example of transition probability in the variant of the game presented in 
section 14.11 

turn is the probability of rolling a 1,5,6. In figure |5] part of the optimal strategy 
for this variant of the game is shown. The complete optimal strategy is available in 
www. cmat . edu. uy/cmat /do cent es/fabian/documentos/opt imalexactly.pdf . 
Some remarks about the solution: 

• As in the classical game, when the target is far, the strategy is similar to "stop 
if r > 20". 

• Unlike the classical game, the optimal strategy in this case never becomes so 
risky. This is easy to understand because the probability of winning in any turn 
is less or equal than 1/6, even in the case of being very close to the target. 
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Figure 5: Part of the optimal strategy for player one (for player two is symmetric) 



for the game presented in subsection 4.1 It includes states with opponent score 
/9 = 0, 150, 180 & 198. In the gray zone the optimal action is to roll and in the black 
zone is to stop. 



• When a + T = 194 there is a "roll zone" larger than usual, because 194 is the 
largest score in which there is no risk of losing in one roll but it is possible to 
win rolling a 6. 

4.2 Maximising the expected difference 

In the second variant of the game, the winner, when reaching the target, obtains from 
the loser the difference between the target and the loser's score. 

Again, the model of the game is very similar to the classical model, the only 
difference is the payoff function: 

' 200 - /3, if s = (1, a, /3, r) with a + r > 200, 
a - 200, if s = (2,a,/3,T) with /? + T > 200, 
0, else. 
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Figure |6] shows the optimal strategy for some opponent scores. The complete optimal 
strategy can be found at 

www. cmat . edu.uy/cmat/docentes/f abian/documentos/optimalmaxdif .pdf . 

The main difference with the optimal strategy in the classical case, is that when one 
player is close to win (taking into account his current turn score), he takes the risk 
of rolling, this feature being observed for any score of the opponent. 
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Figure 6: Part of the optimal strategy for player one (for player two is symmetric) 



for the variant presented in subsection 4.2 It includes states with opponent score 



(3 = 0, 150, 170 & 180. In the gray zone the optimal action is to roll and in the black 
zone is to stop. 



5 Conclusion 

In this paper we model a dice game in the framework of Markov competitive decision 
processes (also known as stochastic games) in order to obtain optimal strategies for 
a player. Our main results are the proof of the existence of a value and an optimal 
minimax strategy for the game, and the proposal of an algorithm to find this strategy. 
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We base our results on the theory of transient stochastic games exposed by Filar and 
Vrieze in p]. 

Previous mathematical treatments of this problem include the solution of the 
optimal stopping problem for a player that wants to maximise the expected number 
of points in a single turn (see Roters [S' ) and the minimisation of the expected number 
of turns required to reach a target (see Haigh and Roters and [3]). Another previous 
contribution was done by Neller and Presser Jl], who found the optimal strategy in 
the set of stationary pure strategies, departing from a Bellman equation. 

We also provide an algorithm to compute explicitly this optimal strategy (that 
coincides with the optimal strategy in the larger class of behaviour strategies) and 
show how this algorithm works in three different variants of the game. 
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